In this report, we will try to analyze the data set from world health organization. This dataset contains more than 10,0000 records from 2020-01 to now worldwide. Our question of interest is the relationship between new deaths and new cases. So we create several new datasets to explain it. This report includes several statistical analysis, such as descriptive analysis, inferential analysis and sensitivity analysis. And according to our results of this project, we will also have some discussion.
Due to COVID-19, the whole world is suffering from a burdensome pandemic.Not only millions of people had been infected, but also more than 2 millions people lost their lives in this pandemic.Up to now, many countries in the world still face the huge pressure caused by the epidemic, not only in the aspect of public health but also in the aspects of economic and politics. To help the public recognize the current situation of epidemic, we will try to depict this pandemic through data and try to find some causal relationship about deaths and the number of people infected. We hope that our project can be the reference for the normal people to understand the epidemic more objectively and help the government and policymakers to face the epidemic easier.
In this project, we will focus on the data about number of the confirmed cases and number of deaths including accumulating number of cases and deaths and new cases and deaths every day. To help the public understand the current situation of the epidemic and find some causal relationship about the data, we will try to answer several questions we are interested in in this project.
How the number of new cases and new deaths change with the time in each region and the whole world? How can we evaluate this trend?
Whether there is any differences in new cases and new deaths each month between different regions?
How can we describe the relationship between new cases and new deaths?
The result of our project may be a good reference for the public to recognize the current situation of the epidemic and help them know how the epidemic developed in the past time. Moreover, our project also can help the government make policy to control the epidemic and improve the sitiuation of public health.
In this project, the analysis uses worldwide data of the COVID-19 over the January 2020 through February 2021 time period from the WHO data set.
The original data set contains 100725 sets of observations and 8 different variables. Each observation represents various information of each country in each day, which is very large and complex. Therefore, variables WHO_region, New_cases and New_deaths are selected to form a new data set. Firstly, we divide those country into 7 regions according to the WHO standard. The reason to do this partition is to more clearly find the change of trend in each region with certain countries with similar other conditions, such like culture, geographic location and so on. There are 50 countries in Africa, 56 countries in Americas, 22 countries in Eastern Mediterranean, 62 countries in Europe, 11 countries in South-East Asia and 35 countries in Western Pacific, only 1 country is in the other region. In order to analysis more conveniently and remarkably, here, we only care about the information about the first six regions. Then, to reduce the number of observations, we add up the number of New_cases and New_deaths accordingly for each region and every month. Here, we use the data based on month is to amplify the change of trend. If we use the original daily data, there is no significant change in both new cases and new deaths in one day and the next day which is not obvious to show in chart or model.
In this way, a new data set that the follow-up research relies on has been formed. In the new data set, there are 84 observations and 6 variables. “month” refers to the time information of each observation from 2020-01 to 2021-02. “WHO_region” refers to location information of each information. “New_cases” refers to the total increasing amount of new cases in certain region in certain month. “New_deaths” refers to the total increasing amount of death cases in certain region in certain month. “Cumulative_cases” refers to the total amount of new cases in certain region till certain month. “Cumulative_deaths” refers to the total amount of death cases in certain region till certain month.
In fact, there have been a lot of related studies using similar or different data set and their conclusions can give the analysis a lot of inspiration. For example, Chatters, Linda M.; Taylor, Harry Owen; Taylor, Robert Joseph, their study (2020) found that “ Black people and older adults are the two groups most affected by COVID-19 morbidity and mortality.” Causey, J.; Harnack-Eber, A.; Huie, F.; Lang, R.; Liu, Q.; Ryu, M.; Shapiro, D, their study (2020) found that “Black, Hispanic, and indigenous populations in the U.S. have seen disproportionately high COVID-19 cases and virus-related deaths compared to Whites.”
As we explained before, we will not analysis the data set day by day but we will concern monthly data instead. First, we can check the summary of monthly new cases, new deaths, cumulative cases and cumulative deaths.
| 2020-01 (N = 6757) | 2020-02 (N = 6757) | 2020-03 (N = 7223) | 2020-04 (N = 6990) | 2020-05 (N = 7223) | 2020-06 (N = 6990) | 2020-07 (N = 7223) | 2020-08 (N = 7223) | 2020-09 (N = 6990) | 2020-10 (N = 7223) | 2020-11 (N = 6990) | 2020-12 (N = 7223) | 2021-01 (N = 7223) | 2021-02 (N = 6524) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| New_cases | ||||||||||||||
| minimum | -4.00 | 0.00 | -4.00 | -1,878.00 | -2,461.00 | -766.00 | -1.00 | -1,385.00 | -8,086.00 | -5.00 | -32,952.00 | -1.00 | -1.00 | 0.00 |
| median (IQR) | 0.00 (0.00, 0.00) | 0.00 (0.00, 0.00) | 0.00 (0.00, 7.00) | 4.00 (0.00, 63.00) | 4.00 (0.00, 82.00) | 7.00 (0.00, 149.75) | 12.00 (0.00, 262.00) | 27.00 (0.00, 271.00) | 30.00 (0.00, 319.00) | 30.00 (0.00, 580.50) | 39.00 (0.00, 974.00) | 53.00 (0.00, 988.50) | 80.00 (0.00, 830.50) | 56.00 (0.00, 775.00) |
| mean (sd) | 1.46 ± 44.94 | 11.23 ± 236.75 | 97.72 ± 750.92 | 331.61 ± 2,023.73 | 392.54 ± 2,059.03 | 605.91 ± 3,079.07 | 963.02 ± 5,412.95 | 1,143.25 ± 6,406.12 | 1,230.83 ± 7,013.34 | 1,690.03 ± 7,008.43 | 2,450.79 ± 11,000.90 | 2,633.71 ± 14,425.23 | 2,752.38 ± 16,464.47 | 1,715.96 ± 7,507.70 |
| maximum | 1,984.00 | 15,152.00 | 20,341.00 | 38,509.00 | 41,411.00 | 54,771.00 | 74,354.00 | 157,273.00 | 194,121.00 | 150,573.00 | 193,734.00 | 402,270.00 | 473,093.00 | 141,327.00 |
| Cumulative_cases | ||||||||||||||
| minimum | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| median (IQR) | 0.00 (0.00, 0.00) | 0.00 (0.00, 0.00) | 5.00 (0.00, 67.50) | 194.00 (16.00, 1,604.75) | 673.00 (60.00, 5,555.50) | 1,176.00 (117.00, 11,449.75) | 2,028.00 (190.50, 19,564.50) | 3,266.00 (336.00, 35,150.00) | 4,891.00 (497.00, 46,293.50) | 7,564.00 (566.00, 61,832.00) | 9,992.00 (741.00, 98,603.25) | 13,143.00 (1,011.00, 136,594.00) | 17,553.00 (1,541.00, 167,667.00) | 23,157.50 (2,126.00, 200,101.75) |
| mean (sd) | 5.57 ± 182.60 | 241.91 ± 3,915.31 | 1,166.88 ± 7,903.09 | 8,525.91 ± 47,324.48 | 19,306.48 ± 100,792.79 | 34,137.27 ± 160,233.09 | 58,070.19 ± 278,823.29 | 91,765.21 ± 446,867.62 | 127,727.86 ± 616,634.05 | 170,410.30 ± 794,312.95 | 235,427.18 ± 1,020,779.67 | 314,119.38 ± 1,373,090.27 | 399,118.04 ± 1,790,747.70 | 464,443.97 ± 2,082,142.97 |
| maximum | 9,720.00 | 79,389.00 | 140,640.00 | 1,003,974.00 | 1,734,040.00 | 2,537,636.00 | 4,388,566.00 | 5,899,504.00 | 7,077,015.00 | 8,852,730.00 | 13,082,877.00 | 19,346,790.00 | 25,676,612.00 | 28,174,978.00 |
| New_deaths | ||||||||||||||
| minimum | 0.00 | 0.00 | -2.00 | 0.00 | -514.00 | -31.00 | -13.00 | -10.00 | -2.00 | -5.00 | 0.00 | -1.00 | -4.00 | 0.00 |
| median (IQR) | 0.00 (0.00, 0.00) | 0.00 (0.00, 0.00) | 0.00 (0.00, 0.00) | 0.00 (0.00, 2.00) | 0.00 (0.00, 2.00) | 0.00 (0.00, 2.00) | 0.00 (0.00, 3.00) | 0.00 (0.00, 4.00) | 0.00 (0.00, 4.00) | 0.00 (0.00, 6.00) | 0.00 (0.00, 13.00) | 0.00 (0.00, 16.00) | 1.00 (0.00, 16.00) | 1.00 (0.00, 13.00) |
| mean (sd) | 0.03 ± 1.01 | 0.40 ± 6.97 | 5.04 ± 46.06 | 26.59 ± 166.95 | 19.69 ± 132.85 | 19.25 ± 106.93 | 22.56 ± 119.79 | 25.10 ± 135.96 | 22.93 ± 125.30 | 25.34 ± 113.37 | 38.90 ± 133.90 | 47.23 ± 195.13 | 57.23 ± 285.65 | 47.04 ± 230.58 |
| maximum | 49.00 | 252.00 | 971.00 | 6,409.00 | 5,000.00 | 2,516.00 | 3,876.00 | 3,935.00 | 3,850.00 | 3,351.00 | 2,248.00 | 3,443.00 | 8,193.00 | 5,512.00 |
| Cumulative_deaths | ||||||||||||||
| minimum | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| median (IQR) | 0.00 (0.00, 0.00) | 0.00 (0.00, 0.00) | 0.00 (0.00, 0.00) | 4.00 (0.00, 39.00) | 13.00 (1.00, 122.00) | 24.00 (2.00, 253.00) | 42.00 (3.00, 364.00) | 63.00 (5.00, 576.50) | 82.00 (10.00, 744.00) | 113.00 (11.00, 950.00) | 137.00 (13.00, 1,483.75) | 169.00 (17.00, 2,251.00) | 251.00 (21.00, 2,885.00) | 328.00 (25.00, 3,246.00) |
| mean (sd) | 0.13 ± 4.26 | 6.94 ± 118.69 | 50.56 ± 453.17 | 579.35 ± 3,305.46 | 1,296.18 ± 6,883.30 | 1,866.67 ± 9,234.34 | 2,495.18 ± 11,498.47 | 3,258.76 ± 14,579.63 | 3,994.84 ± 17,557.72 | 4,716.76 ± 20,239.93 | 5,669.98 ± 23,063.62 | 7,024.77 ± 27,461.19 | 8,609.70 ± 33,539.50 | 10,237.09 ± 40,240.64 |
| maximum | 213.00 | 2,838.00 | 11,591.00 | 57,730.00 | 102,640.00 | 126,203.00 | 150,054.00 | 181,689.00 | 203,875.00 | 227,178.00 | 263,946.00 | 335,789.00 | 433,173.00 | 506,760.00 |
In summary table, some minimum of new cases and new deaths are negative, probably because the cases and deaths were wrongly attributed to the Covid-19 before and the data was modified when mistake was found. The median and mean of new cases were increasing from 2020/01 to 2021/01, but they were both decreasing from 2021/01 to 2021/02. The standard deviation of new cases peaked at 14731.94 in 2021/01. The mean of new death showed a general tendancy to increase from 2020/01 to 2021/02, but it fell back in the interval of 2020/04-2020/06, 2020/08-2020/09 and 2021/01-2021/02. The standard deviation of new deaths peaked at 260.16 in 2021/01.
In our project, we will concern how monthly new cases and monthly new deaths change with the time in different regions.
The time series plot of new cases in different WHO regions shows the number of new cases of six WHO regions from 2020/01 to 2021/02. New cases in AFRO had two waves of slow growth and decreased after the growth. On the whole, the change of new cases in AFRO is smooth. New cases in WPRO changed more slowly than new cases in AFRO. New cases in EMRO also had two waves of slow growth and decreased after the growth. The increase of new cases in EMRO was more severe than in AFRO. The number of new cases in SEARO increased from 2020/01 to 2020/09 and decreased from 2020/09 to 2021/02. The number of new cases in EURO increased from 2020/01 to 2020/04 and from 2020/06 to 2020/11. The number of new cases in EURO decreased from 2020/04 to 2020/06 and from 2020/11 to 2021/02. The number of new cases in AMRO increased from 2020/01 to 2020/08 and from 2020/09 to 2021/01. The number of new cases in AMRO decreased from 2020/08 to 2020/09 and from 2021/01 to 2021/02.
The time series plot of new deaths in different WHO regions shows the number of new deaths of six WHO regions from 2020/01 to 2021/02. New deaths in AFRO had two waves of slow growth and decreased after the growth. New deaths in WPRO changed more slowly than new deaths in AFRO. New deaths in EMRO also had two waves of slow growth and decreased after the growth, while the increase of new deaths in EMRO was more severe than in AFRO. The number of new cases in SEARO increased from 2020/01 to 2020/09 and decreased from 2020/09 to 2021/02, which is consistent with the increase of new cases in SEARO. The number of new deaths in EURO increased from 2020/01 to 2020/04 and from 2020/08 to 2021/01. The number of new deaths in EURO decreased from 2020/04 to 2020/08 and from 2021/01 to 2021/02. The number of new deaths in AMRO increased from 2020/01 to 2020/08 and from 2020/11 to 2021/01. The number of new deaths in AMRO decreased from 2020/08 to 2020/11 and from 2021/01 to 2021/02.
In the plot of cumulative cases over different countries from 2020/01/03 to 2021/2/28, the number of cumulative cases in the United States was the highest, which is 28174978. India had the second highest cumulative cases 11096731 and Brazil had the third highest cumulative cases 10455630. Then we explore the geometric distribution of the cumulative deaths. In the plot of cumulative deaths over different countries from 2020/01/03 to 2021/2/28, the number of cumulative deaths in the United States was the highest. Brazil had the second highest cumulative deaths and India had the third highest cumulative deaths. The United States, India, and Brazil are all have a large population, so it is unreasonable to judge the severity of Covid-19 by the cumulative cases and cumulative deaths. To eliminate the effect of population base, we use death rate to see which country was in the worst situation. In the plot of death rate over different countries from 2020/01/03 to 2021/2/28, Yemen had the highest death rate 0.2780. Mexico had the second highest death rate 0.0888, and China had the third highest death rate 0.0475. Between 2020/01 and 2020/01, the covid was most heavy in China. The number of new cases grew rapidly from 9724 to 69669 for a month. So China compulsively requires people to cooperate in Covid-19 prevention. And the effect is very significant, the number of new cases dropped to 3152 on 2020/03. However, the covid-19 spread rapidly around the world’s cha as Spain had 108364 number of new cases in March. From 2020/03 to 2021/01, the map became redder and redder, especially North America and Asia & Pacific except China. Most countries had a decrease in February 2021. Between 2020/01 and 2020/02, the number of new deaths grew rapidly from 213 to 2625 for a month. After China compulsively required people to cooperate in Covid-19 prevention,the number of new cases drop to 476 in 2020/03. At that time, in the United States, Spain, France and some countries, more than 2000 people died because of Covid-19. The covid-19 spread rapidly around the regions, Canada and Mexico turned red as the United States, South America were still safe compared with other regions in April. From 2020/05 to 2021/02, the number of death cases is changing in different countries. However, Russian, Brazil, United States and so on were always red, the number of people who died in these countries more than ten thousand each month.Between 2020/01 and 2020/01, the covid was most heavy in Regional Office for Western Pacific (WPRO), more than 97% of new cases were in WPRO. On 2020/03, the number of new cases decreased rapidly in WPRO. However, the Regional Office for Europe (EURO) accounts for 65.1% of all regions with 459642 number of new cases. The covid-19 spread rapidly around all regional offices. Until 2021/02, Regional Office for Americas (AMRO) account for most parts of all regions, besides 2020/10, Europe (EURO) had 8081456 new cases, accounting for 47.1%.
Between 2020/01 and 2020/01, the covid was most heavy in Regional Office for Western Pacific (WPRO), more than 97% of new deaths were in WPRO. In 2020/03, the number of new deaths decreased rapidly in WPRO from 2648 to 811. Then, Regional Office for Europe (EURO) accounts for 80.91% of all regions with 29435 number of new cases On 2020/03 and account 58.37% of all regions with 108540 number of new cases On 2020/04. Later, The covid-19 spread rapidly around the Regional Office for the Americas (AMRO). Until 2021/02, Regional Office for Americas (AMRO) account for the most of all regions, besides 2020/10 and 2020/11, Regional Office for Europe (EURO) had more new deaths than Regional Office for Americas (AMRO).
Overall, the trend of new cases and new deaths seemed similarly.
As we discuss before, we can pay attention on the relationship between monthly new cases and monthly new deaths,but first, we will concern whether there is any differences existing in new cases and new deaths in each month between different regions. We can use two-way ANOVA model to answer this question.
We can see that both monthly new cases and monthly new deaths show differences in different regions and according to the previous analyze, we know that the new cases and new deaths also show differences in different month, so we can try to conduct ANOVA.
\(Y_{itk}=\mu_{..}+\alpha_i+\tau_t++\epsilon_{itk}\)
\(Y_{it}\) stands for the new cases in \(i\) region in \(t\) month. \(\alpha_i\) stands for the effect from different regions and \(\tau_t\) stands for effect from different months.
\(Z_{itk}=\nu_{..}+\alpha_i+\tau_t++\epsilon_{itk}\)
\(Z_{it}\) stands for the new deaths in \(i\) region in \(t\) month. \(\alpha_i\) stands for the effect from different regions and \(\tau_t\) stands for effect from different months.
And we can conduct the test to show whether the coefficients of these two model are significant:
\(H_{\alpha0}:\alpha_i = 0, \ for \ all \ i\)
\(H_{\alpha1}:not \ all \ \alpha_i \ are \ 0\)
\(H_{\tau0}: \tau_t = 0, \ for \ all \ t\)
\(H_{\tau1}:not \ all \ \tau_t \ are \ 0\)
We can check the summary of the first ANOVA model:
## Df Sum Sq Mean Sq F value Pr(>F)
## WHO_region 5 1.518e+14 3.037e+13 12.615 1.43e-08 ***
## month 13 1.047e+14 8.055e+12 3.346 0.000604 ***
## Residuals 65 1.565e+14 2.407e+12
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
and the summary of the second ANOVA model:
## Df Sum Sq Mean Sq F value Pr(>F)
## WHO_region 5 8.634e+10 1.727e+10 18.047 3.48e-11 ***
## month 13 3.226e+10 2.482e+09 2.594 0.00585 **
## Residuals 65 6.219e+10 9.568e+08
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
According to the summary of these two ANOVA models,both the p-values of F test of month and WHO_region are very small, so we can consider that both of WHO_region and moth are statistically significant in these two models, which means that the number of new cases and new deaths are different in different month and different regions.
As we discussed in ANOVA part, the number of monthly new cases and monthly new deaths are different in different regions and according to the time series plots, we can find that the new cases and new deaths show the similar trends in each region, so we may consider to analyze the relationship between monthly new cases and monthly new deaths treating regions as dummy variable. First, we can check the scatter plot between monthly new cases and monthly new deaths.
In this plot, we can see that for different region, the rate of change between of new cases and new deaths seem to be different, but if we draw the regression line for each region, both of them seem to pass the origin, so it may imply us to regress the model without interception and treat the regions as dummy variables which will affect the slope of the model.
The model will be:
\(Z_j = \beta_1 Y_j+\beta_2 Y_jR_1 +\beta_3 Y_jR_2+\beta_4 Y_jR_3+\beta_5 Y_jR_4+\beta_6 Y_jR_5+\epsilon_j\), \(\epsilon_j \sim N(0,\sigma^2),i.i.d\)
\(Z_j\) stands for the monthly new deaths, \(Y_j\) stands for the monthly new cases and \(R_1,R_2,...R_{6}\) are the dummy variables representing different WHO regions.We can check the summary of this model:
##
## Call:
## lm(formula = RNew_deaths ~ RNew_cases:factor(WHO_region) - 1,
## data = data_region)
##
## Residuals:
## Min 1Q Median 3Q Max
## -56240 -540 252 2633 89068
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## RNew_cases:factor(WHO_region)AFRO 0.025519 0.019135 1.334 0.186210
## RNew_cases:factor(WHO_region)AMRO 0.020437 0.001141 17.910 < 2e-16 ***
## RNew_cases:factor(WHO_region)EMRO 0.021999 0.009692 2.270 0.025985 *
## RNew_cases:factor(WHO_region)EURO 0.019700 0.001334 14.767 < 2e-16 ***
## RNew_cases:factor(WHO_region)SEARO 0.014404 0.004060 3.548 0.000661 ***
## RNew_cases:factor(WHO_region)WPRO 0.015761 0.036810 0.428 0.669699
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20080 on 78 degrees of freedom
## Multiple R-squared: 0.8775, Adjusted R-squared: 0.868
## F-statistic: 93.09 on 6 and 78 DF, p-value: < 2.2e-16
For all coefficients of this model, we can conduct the test:
\(H_0: \beta_1= \beta_2=...\beta_{6} = 0\)
\(H_1: not \ all \ \beta_i \ are \ 0\)
According to the summary of the model, the p-value of F-statistic is very small, which means that we can reject \(H_0\) and consider that this model is statistically significant on the whole.And for single coefficient, we can see that most of the coefficients are significant too.
In our model, we do not consider the influence of time. Now we can use mixed effect regression treating month as random effect. And we can check the summary of this mixed effort model.
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: RNew_deaths ~ RNew_cases:factor(WHO_region) + (1 | month)
## Data: data_region
##
## REML criterion at convergence: 1922.8
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.9120 -0.3573 -0.1562 0.0916 4.1560
##
## Random effects:
## Groups Name Variance Std.Dev.
## month (Intercept) 13778383 3712
## Residual 363247663 19059
## Number of obs: 84, groups: month, 14
##
## Fixed effects:
## Estimate Std. Error df t value
## (Intercept) 8.171e+03 3.453e+03 1.837e+01 2.366
## RNew_cases:factor(WHO_region)AFRO 3.788e-03 2.043e-02 7.285e+01 0.185
## RNew_cases:factor(WHO_region)AMRO 1.910e-02 1.233e-03 7.132e+01 15.497
## RNew_cases:factor(WHO_region)EMRO 9.961e-03 1.060e-02 7.350e+01 0.940
## RNew_cases:factor(WHO_region)EURO 1.836e-02 1.414e-03 7.174e+01 12.985
## RNew_cases:factor(WHO_region)SEARO 1.012e-02 4.328e-03 7.540e+01 2.337
## RNew_cases:factor(WHO_region)WPRO -2.927e-02 3.998e-02 7.282e+01 -0.732
## Pr(>|t|)
## (Intercept) 0.0291 *
## RNew_cases:factor(WHO_region)AFRO 0.8535
## RNew_cases:factor(WHO_region)AMRO <2e-16 ***
## RNew_cases:factor(WHO_region)EMRO 0.3503
## RNew_cases:factor(WHO_region)EURO <2e-16 ***
## RNew_cases:factor(WHO_region)SEARO 0.0221 *
## RNew_cases:factor(WHO_region)WPRO 0.4664
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) RN_:(WHO_)AF RN_:(WHO_)AM RN_:(WHO_)EM RN_:(WHO_)EU
## RN_:(WHO_)AF -0.426
## RN_:(WHO_)AM -0.447 0.217
## RN_:(WHO_)EM -0.468 0.223 0.235
## RN_:(WHO_)EU -0.411 0.198 0.211 0.219
## RN_:(WHO_)S -0.423 0.200 0.211 0.221 0.195
## RN_:(WHO_)W -0.456 0.220 0.231 0.238 0.213
## RN_:(WHO_)S
## RN_:(WHO_)AF
## RN_:(WHO_)AM
## RN_:(WHO_)EM
## RN_:(WHO_)EU
## RN_:(WHO_)S
## RN_:(WHO_)W 0.215
## fit warnings:
## Some predictor variables are on very different scales: consider rescaling
We can get the similar conclusion.
We can check the basic assumption of our model and do some diagnostics.
According to the plots, this model can not fit the basic assumptions really well,especially the normality,so we may consider do some transformation on our model.To be convenient,we can remove the data which new deaths are 0.And according to the boxcox, we can use log(RNew_deaths). We can see the summary and the plots of the transformed model.
##
## Call:
## lm(formula = RNew_deaths ~ RNew_cases:factor(WHO_region) - 1,
## data = data_change)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.7476 -0.1285 3.6686 5.6639 9.6653
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## RNew_cases:factor(WHO_region)AFRO 2.319e-05 4.937e-06 4.697 1.28e-05 ***
## RNew_cases:factor(WHO_region)AMRO 1.903e-06 2.944e-07 6.463 1.17e-08 ***
## RNew_cases:factor(WHO_region)EMRO 1.417e-05 2.501e-06 5.666 3.01e-07 ***
## RNew_cases:factor(WHO_region)EURO 1.953e-06 3.442e-07 5.673 2.93e-07 ***
## RNew_cases:factor(WHO_region)SEARO 5.576e-06 1.047e-06 5.323 1.17e-06 ***
## RNew_cases:factor(WHO_region)WPRO 4.272e-05 9.497e-06 4.498 2.66e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.18 on 70 degrees of freedom
## Multiple R-squared: 0.7162, Adjusted R-squared: 0.6919
## F-statistic: 29.45 on 6 and 70 DF, p-value: < 2.2e-16
We can see that all of the coefficients are significant and according to the QQ plot,the model seems to be normality, but it shows a line in Residuals vs Fitted, which may imply that there exist heteroscedasticity or other linear relationship in our model.
Since our existing model is affected by the number of parameters and cannot eliminate selection bias well, we cannot get a good causal interpretation here. According to our model and analysis, we can draw some association conclusions and explanations. First of all, according to the scatter diagram, we can initially obtain that there is a relatively strong linear positive relationship between new deaths and new cases on the whole. As new cases increase, new deaths will also increase. We can explain that the increase in new cases indicates that the epidemic is still continuous and has not been effectively alleviated. Death cases must be generated from diagnosed cases, so large new cases provide a large base for new deaths. According to model analysis, we can find that in the six WHO regions, there are not too many new cases and new deaths in Africa and Eastern Mediterranean, but the association between these two variables is very strong, which may be related to local poor medical conditions and other factors. In America and Europe, new cases and new deaths are the most, and the correlation between them is relatively strong. This may be related to people’s living habits. People admire freedom and the countries are very close to each other, so people can travel to and from different countries easily. It will further aggravate the epidemic. In South-East Asia, there are not many new cases and new deaths, and the correlation between them is not very strong. It may be due to the climate of this region. It is in the tropics and the temperature is high, which is not conducive to the spread of covid-19 and the ability to cause death. In the Western Pacific region, new cases and new deaths are the least, and the correlation between them is the weakest. This is related to the policies of these countries, and strict policy formulation is bound to be conducive to the control of the epidemic.
In this report, first, we can find that the number of new cases and new deaths shows a similar trend in each region from the interactive visualization maps. Second, the result of ANOVA model shows that the number of new cases and new deaths are different in different month and different regions. Then, we use regression model to explore the relationship between new deaths and new cases and find that the monthly new deaths shows a linear relationship between monthly new case and the interaction of region. Therefore, regions with high deaths should learn from regions with low deaths because new deaths are significantly different in different regions. Also, in order to reduce the number of deaths, each region should control the number of new cases.
There is still room for improvement in this report. First, the regression model in inferential analysis can not fit the basic assumptions really well, so we may consider do some transformation on the model to make it more reasonable. Also, for this report, we still need more data. We think population, infection rate and specific strategies of different region may also affect the number of new deaths. By doing the further test with new variables, it can help us comprehend the influence of the covid further. Every life is precious, we also hope further research do some works.
Chatters, Linda M.; Taylor, Harry Owen; Taylor, Robert Joseph. (2020). Older Black Americans during COVID-19: Race and Age Double Jeopardy. CA: SAGE Publications. Causey, J.; Harnack-Eber, A.; Huie, F.; Lang, R.; Liu, Q.; Ryu, M.; Shapiro, D. (2020). COVID-19 Transfer, Mobility, and Progress: First Look Fall 2020 Report. VA: National Student Clearinghouse Research Center.
The whole code and dataset we used are uploaded to: https://github.com/xiaoyi-xu/project3
sessionInfo()
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.7
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] MASS_7.3-51.6 maps_3.3.0 lmerTest_3.1-3 lme4_1.1-25
## [5] Matrix_1.2-18 qwraps2_0.5.0 gplots_3.1.1 RCurl_1.98-1.2
## [9] XML_3.99-0.5 sparkline_2.0 DT_0.17 echarts4r_0.3.3
## [13] plotly_4.9.3 lubridate_1.7.9 forcats_0.5.0 stringr_1.4.0
## [17] dplyr_1.0.2 purrr_0.3.4 readr_1.4.0 tidyr_1.1.2
## [21] tibble_3.0.4 ggplot2_3.3.2 tidyverse_1.3.0 RJSONIO_1.3-1.4
##
## loaded via a namespace (and not attached):
## [1] nlme_3.1-148 bitops_1.0-6 fs_1.5.0
## [4] httr_1.4.2 numDeriv_2016.8-1.1 tools_4.0.2
## [7] backports_1.1.10 R6_2.4.1 KernSmooth_2.23-17
## [10] DBI_1.1.0 lazyeval_0.2.2 colorspace_1.4-1
## [13] withr_2.3.0 tidyselect_1.1.0 curl_4.3
## [16] compiler_4.0.2 cli_2.1.0 rvest_0.3.6
## [19] xml2_1.3.2 labeling_0.3 caTools_1.18.1
## [22] scales_1.1.1 digest_0.6.25 minqa_1.2.4
## [25] rmarkdown_2.4 pkgconfig_2.0.3 htmltools_0.5.1.1
## [28] highr_0.8 dbplyr_1.4.4 fastmap_1.1.0
## [31] htmlwidgets_1.5.3 rlang_0.4.10 readxl_1.3.1
## [34] rstudioapi_0.11 shiny_1.6.0 farver_2.0.3
## [37] generics_0.1.0 jsonlite_1.7.1 crosstalk_1.1.1
## [40] gtools_3.8.2 magrittr_1.5 Rcpp_1.0.5
## [43] munsell_0.5.0 fansi_0.4.1 lifecycle_0.2.0
## [46] stringi_1.5.3 yaml_2.2.1 grid_4.0.2
## [49] blob_1.2.1 promises_1.1.1 crayon_1.3.4
## [52] lattice_0.20-41 haven_2.3.1 splines_4.0.2
## [55] hms_0.5.3 knitr_1.30 pillar_1.4.6
## [58] boot_1.3-25 reprex_0.3.0 glue_1.4.2
## [61] evaluate_0.14 data.table_1.13.2 modelr_0.1.8
## [64] nloptr_1.2.2.2 vctrs_0.3.4 httpuv_1.5.5
## [67] cellranger_1.1.0 gtable_0.3.0 assertthat_0.2.1
## [70] xfun_0.18 mime_0.9 xtable_1.8-4
## [73] broom_0.7.1 later_1.1.0.1 viridisLite_0.3.0
## [76] statmod_1.4.35 ellipsis_0.3.1